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In genetic association studies, a single marker is often associated with multiple, correlated phe- 
notypes (e.g., obesity and cardiovascular disease, or nicotine dependence and lung cancer). A 
pervasive question is then whether that marker has independent effects on all phenotypes. In 
this article, we address this question by assessing whether there is a direct genetic effect on one 
phenotype that is not mediated through the other phenotypes. In particular, we investigate how 
to identify and estimate such direct genetic effects on the basis of (matched) case-control data. 
We discuss conditions under which such effects are identifiable from the available (matched) 
case-control data. We find that direct genetic effects are sometimes estimable via standard re- 
gression methods, and sometimes via a more general G-estimation method, which has previously 
been proposed for random samples and unmatched case-control studies [371[39] and is here ex- 
tended to matched case-control studies. The results are used to assess whether the FTO gene 
is associated with myocardial infarction other than via an effect on obesity. 



1 Introduction 

Associations of a genetic variant with a primary phenotype can be difficult to interpret 
when one considers the likely presence of correlated phenotypes. The genetic association 
may then be the indirect result of genetic effects on a correlated phenotype, which sub- 
sequently affect the primary phenotype. For instance, Chanock and Hunter [5] discuss 
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three genetic association studies which identified an association between a genetic varia- 
tion on chromosome 15 and the risk of lung cancer, but the studies disagree on whether 
the hnk is direct or mediated through nicotine dependence. Addressing this question may 
be important to a better understanding of the underlying causal mechanism. This article 
addresses the general problem of inferring the direct effect of a marker X on a trait Y 
(e.g., lung cancer), controlling for a correlated trait M (e.g., nicotine dependence), which 
we will refer to ClS db mediating variable or mediator. 

Vansteelandt et al. [39J consider this problem in the context of prospective studies of 
genetic association. Motivated by the frequent use of ascertained samples in those studies, 
in this paper we extend the method to matched case-control studies. We show that 
case-control sampling seriously complicates the identification of direct genetic effects. 
Progress can be made within certain classes of statistical models and under specific no 
unmeasured confounding assumptions. In particular, we find that, under very restrictive 
conditions, direct effects are estimable from case-control data by using standard regression 
methods, and that they are estimable under more lenient conditions by using special 
G-estimation methods [37], which we here extend to matched case-control data. In this 
paper, the required conditions for estimability are unambiguously expressed as conditional 
independence relationships between problem variables, which we can check on a causal 
diagram [H1IIZII25] . We illustrate the method with the aid of a motivating study, in which 
we use matched case-control data to assess whether variation in the chromosomal region 
of the FTO gene causally affects susceptibility to myocardial infarction other than via an 
increase in body mass. 

2 Motivating study 

FTO is a large gene on chromosome 16, that is highly expressed in the hypothalamic 
nuclei that control eating behaviour in mice [13]. The first intron of FTO harbours the 
single nucleotide polymorphism (SNP) rs9939609, associated with body mass [35] and 
myocardial infarction [Tl[T3l[Tll|28l|32l[33lllT]. A simple, tentative, interpretation of the 
evidence is that genetic variation represented (or refiected) by rs9939609 amplifies the 
obesity-inducing effect of FTO, thereby indirectly affecting susceptibility to infarction. 
However, an analysis of the data of SectionEl based on the method we propose in Section^ 
shows that the effect of rs9939609 on infarction is not entirely mediated by body mass. 
This finding points to a different theory of the role of rs9939609 in the development of an 
infarction. 

Figure la shows a causal diagram representation of the problem. Causal diagrams are 
reviewed in Appendix 1. In the diagram, we let GENO denote genetic variation respon- 
sible for changes in risk of infarction and correlated with rs9939609. We let MI denote 
occurrence or nonoccurrence of infarction. Let DEMO represent the following set of 
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variables: sex, geographical area of origin and profession. Let BMI represent the body 
mass index. Let BEHAVE represent frequent physical exercise and drinking habit. Ac- 
cording to the diagram, the correlation between BEHAVE and MI is taken to be, in 
part, induced by shared genetic or environmental factors, UNOBSERVED. The miss- 
ing UNOBSERVED — )■ BMI arrow represents the assumption that, conditionally on 
BEHAVE and DEMO, no unobserved risk factors for obesity are associated with infarc- 
tion. Application of the proposed method to the data of Section [6l under the assumptions 
of Figure la, shows that the causal effect of GENO on MI is not entirely mediated by 
BMI, in the sense that a (hypothetical) intervention that fixes the value of BMI would 
not completely block the effect exerted on MI by a (hypothetical) intervention on GENO. 
This finding points to new hypotheses about the role of rs9939609 in susceptibihty to MI. 
At the end of this paper we discuss the biological implications of this finding in the light 
of recent experimental research evidence. 




Figure 1. (a) Causal diagram for our motivating study, (b) same diagram, augmented with 
intervention indicators, as explained in Section [3l 
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3 Controlled direct effects 

More in general, let X denote genetic variation of interest, and the binary variable Y 
indicate occurrence [Y = 1) or non-occurrence {Y = 0) of the disease. Let M denote a 
set of variables along the causal path from X to Y. Define the direct effect of X on Y , 
controlled for M, to be the effect exerted on Y by an intervention that changes the value 
of X from some reference value xq to xi, while keeping M fixed at some reference value, 
tuq P^ISD] . To formalize this concept, we need to represent the idea of "intervention". 
This means to distinguish between the "observational" distribution of the data we are 
analyzing, P0, and the distribution, Pxm, of the data we would have obtained had we 
fixed X to some value x and/or M to some value m. Following Dawid |8], we label 
these different distributions by an intervention indicator ax and an intervention indicator 
(Ta/, where, for H G (X, M), the symbol cth = indicates that the value H is observed 
passively, and the symbol an = h indicates that H is set to h by an intervention. Thus, 
for a generic variable W, the symbol P(Y = 1 \ ax = x, au = m^W = w) denotes the 
probability of occurrence of the outcome event, conditional on observing W = w, when 
we forcefully set X to x and M to m. The direct effect of X on F, controlled for M and 
conditional on a generic set W of observed variables, can now be measured in terms of 
the (causal conditional) relative risk 

P{Y = 1 \ ax = Xi,aM = mp, W) 
P{Y = 1 \ ax = xo,aM = mo, W) ' 
or in terms of the (causal conditional) odds ratio 

odds(y = 1 \ ax = Xi,aM = mp, W) 
odds(F = 1 \ ax = Xq, au = ttiq, W) ' 

where odds(F = 1 \ ax = x, au = m,W = w) = P(Y = 1 \ ax = x, au = m,W = 
w)/ P{Y = Q \ ax = X, aM = m,W = w). 

Because our data are generated from P0, i.e., conditional on ax = 0, ctm = 0, they will - 
in general - be uninformative about the interventional probabilities involved in the direct 
effect of interest, be it in the form ([T]) or in the form ([2]). Does this mean we can never 
estimate a direct effect on the basis of observational data? Luckily, no. Estimation is 
possible in special situations, under identifiability conditions studied in the next section. 
As we shall see, these conditions can be expressed through the language of conditional 
independence [TU], extended by Dawid to accommodate intervention indicators [9j. An 
important tool, in our subsequent discussion, are causal diagrams extended (augmented) 
to incorporate intervention indicators in the form of additional nodes sending arrows into 
their corresponding variables, as in [9]- One example is the causal diagram of Figure Ih, 
which extends the diagram of Figure la by adding nodes to represent the intervention 
indicators for variables X and M. 
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Figure 2. This diagram has been obtained from Figure 16 by adding the selection indicator 
node, S = 1, as explained in Section [5l In the diagram, this node receives arrows from MI, 
DEMO and BEHAVE. This represents the assumption that the probability of a generic 
individual of the study cohort being sampled depends on {MI, DEMO, BEHAVE) while 
being, conditional on these variables, independent of GENO. Its dependence on GENO would 
violate condition (fT5|) . 

4 Estimation from random population samples 

Important results on the identifiability of controlled direct effects have been obtained by 
Robins, Greenland, Didelez, Dawid, Geneletti and Pearl [UlEniESlEO] under the assump- 
tion that the population sample is random. These results are now summarized, with 
the involved assumptions expressed in the form of conditional independence conditions 
between problem variables. 

If there exists a (possibly empty) set W of observed variables such that, conditionally on 
W , there is no unobserved confounding of the relationship between (X, M) and Y , then 
the direct effect of X on Y , controlling for M, is identifiable from random population 
samples and estimable via standard regression of F on X, M and W . The stated condition 
is equivalent to asking that W is not a descendant of either M or X, and that the 
distribution of Y given (X, M, W) is the same, regardless of the way the values of X and 
M are generated, be it observationally or by forceful intervention, formally: 



W AL {crx,(TM), 
Y AL K,aM) I iX,M,W). 



(3) 
(4) 
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In fact, it follows from that 

P(Y = 1\ ax = x,aM = m, W) = P(Y = 1 \ X = x, M = m,W), 

where the righthand side can be obtained as the fitted value from a (logistic) regres- 
sion model, and hence an estimate of the causal conditional relative risk ([T]) can also 
be obtained. Conditions can be checked on an augmented causal diagram by the 

(i-separation criterion jl5[IM] reviewed in Appendix 1, or the equivalent moralisation cri- 
terion |22] . 

EXAMPLE 1: the diagram of Figure lb contains causal paths from GENO to MI that do not 
involve BMI. It makes thus sense to test for a direct effect of GENO on M/, controlling 
for BMI. Conditions (j3]|4]) for this test to be approachable via standard regression imply 
the existence of a (possibly empty) set of variables W that satisfies: 



In order to satisfy ([5]), the set W must not contain a member of BEHAVE. But then, be- 
cause MI and (Tbmi are d-connected when BAH is in the conditioning set and BEHAVE 
is not (in accord with the theory of Appendix 1), condition ([6]) will be inevitably violated, 
and we conclude that, in this example, the direct effect of interest cannot be estimated by 
using standard regression. 

Estimation of the direct effect of X on Y, controlled for M, from prospective observational 
data, is possible under more lenient conditions than ([3lll]), although this will occasionally 
require standard regression to be abandoned in favour of the more general method of G- 
computation [30] ■ These more lenient conditions require that there be a (possibly empty) 
set W of non-causal successors of X such that, conditional on W, there is no confounding 
between Y and X, and a (possibly empty) set Z of non-causal successors of M such that, 
conditional on (X, Z, W), there is no confounding between Y and M. All this is formally 
expressed by the following conditions: 



which are similar to those given in [TT] . Various authors have discussed G-computation pTl 
|23l[26l[29l|3nil36] or G-estimation [20l[371|39] of controlled direct effects from a random 
population sample in such settings. These authors use assumptions fl7|)-f lT0|) . although 
they sometimes adopt a different "language" to express them. 



W _LL {(TGENO,crBMl), 
MI AL {aGENO,^BMi)\{GENO,BMI,W). 



(5) 
(6) 



WAL ax, 
ZAL aM, 

YALax I iX,W), 

YAL aM I (X,M,Z,iy), 



(7) 

(8) 

(9) 
(10) 
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EXAMPLE 2: with reference to the causal diagram of Figure lb, if we specify W = DEMO and 
Z = BEHAVE, then conditions ([7])- ([10]) can be written as: 

DEMO AL acENO, (11) 

BEHAVE AL aBMi, (12) 

MI AL (JGENO I {DEMO,GENO), (13) 

MI AL asMi I {GENO,BMI, BEHAVE, DEMO), (14) 



Conditions (|lip - (|12p are satisfied because neither DEMO is a descendant of GENO, nor 
BEHAVE a descendant of BMI. Condition ([13]) is satisfied because MI and (Tgeno are 
d-separated in Figure 16 when DEMO is in the conditioning set. Finally, condition (I14p is 
satisfied because, as shown in Appendix 1, nodes MI and (Jbmi are d-separated in Figure 
16 if BMI and BEHAVE are in the conditioning set. We conclude that the direct effect 
of GENO on MI, controlling for BMI, is estimable by G-computation from prospective 
observational data, under the assumptions of Figure 16. 



Conditions (!71)-( [T0]) do not prevent Z from being a descendant of X, in which case the con- 
ditioning on Z will - in a general prospective study - create a spurious association between 
X and Y, even in absence of the direct effect we wish to assess [HI [23]. This "collider- 
stratification bias" will prevent standard regression, but not necessarily G-computation 
or G-estimation, from correctly estimating the direct effect of X on Y, controlling for M , 
as shown in l39l . 



5 Estimation from matched case-control studies 

Let us now shift attention to the estimation of controlled direct effects in the context of 
a retrospective design. This is, even under the general conditions ([7|)- ([T0]) . a complicated 
task, one reason being the possible ("exposure- induced mediator-outcome") confounding 
induced by statistical dependence between Z and ax (quite possible under ([7|)-( TTU]) ). The 
literature on estimating controlled direct effects from retrospective designs in presence 
of this type of confounding is, to the best of our knowledge, very limited so far. G- 
estimation approaches to this problem in the context of unmatched case-control studies 
have been suggested by Vansteelandt in [37] and [3H]. The latter paper uses G-estimation 
in combination with logistic regression. In this section, we shall present an approach to 
the problem that works with matched case-control studies. 

We start by including in the causal diagram a special node S, called the selection indi- 
cator, to account for the non-random sampling involved in case-control studies. This is 
exemplified in Figure 2b. The value S = 1 indicates that the individual has been selected 
from the underlying study cohort for inclusion in the study, as in [16l[T9|. Implicit in a 
case-control study is the fact that the selection event, 5", depends on the outcome, Y, and 
this is why we have the Y ^ S arrow in the diagram. Data analysis is (by tautology) 
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performed conditional on S" = 1. Suppose that the usual "rare disease assumption" is 
valid, and that the "coUapsibility" condition 

XALS \ {Y,M,W), (15) 

is satisfied, which makes sure the conditional odds ratio odds(y = 1 \ X = x, M = m, W) 
is not affected by the retrospective sampling [12lll0] . In those situations where the above 
condition is satisfied together with Q-dl]), a standard regression approach to the case- 
control study will work (conditional logistic regression being one option when hevcase- 
control study is matched). In the following, we are concerned with the more difficult 
situation of a matched case-control study where condition f|T5|) . but not ([3])-(jl]), hold. 

Hence suppose that cases and controls have been 1-to-l matched with respect to a set 
W of variables that satisfies conditions (17|)- (ITU]) . Let the VT-matched pairs be indexed by 

1 (with i = 1, . . . ,n) and let the generic notation G'^*-'^ denote the value of a variable of 
interest, G, for the jth member of pair i. Assume the event y = 1 is rare (which is often 
a main motivation for the choice of a retrospective design), and that the following model 
is true: 

E{Y\ax = x,aM = m,W,Z) 

j^fv I ^ TTZ n w 7\ = exp(V'x + 7m), (16) 

i^iy I = U,o-M = (J, Vv,Z) 

where expectations E{.) refer to the population distribution. Then we show in Appendix 

2 that the data will approximately satisfy: 

E* {(X('i) -X('°))exp(-V'X('i) -7M('i))} = 0, (17) 

where the expectation E*{.) refers to the observed data distribution under retrospective 
sampling. The idea is then to fit the logistic regression model: 

logit P(r(^J') = 1 I X^'^^ = X, Z^'^^ = z, M^'^^ = m) = a + 5x + l3z + T]m + b^^ , 

where b^^^ is a mean zero random effect, which expresses the contribution for matched 
pair i. A maximum likelihood estimate of the remaining parameters, {a,S, f3,T]), can be 
obtained via conditional logistic regression, for example by using the CLOGIT procedure 
in R. Under the "no confounding" conditions ([8]) and (ITO!) . the estimate of r], denoted by 
fj, encodes the conditional causal effect of M on Y, represented in Equation (fT6l) by the 
symbol 7. Equation (fT7|) then justifies the use of the following conditional score equation: 

n 

= ^(x(^i) - a;(^°)) exp {-iPx^'^^ - fjm^'^^) (18) 

i=l 

for estimating the direct effect of interest, which is encoded by ip. An estimator for the 
variance of ip is derived in the last paragraph of Appendix 2. 
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EXAMPLE 3: it is easy to show, along the lines of Example 2, that, for W = DEMO and 
Z = BEHAVE, the causal diagram of Figure 2 satisfies conditions ©-([lOD for W = DEMO 
and Z = BEHAVE, and the collapsibility condition GENOALS \ {MI,BMI,DEMO), as 
well. Because of the above considerations, and because early infarction is a rare disease, 
we conclude that the direct effect of GENO on MI, controlling for BMI, is estimable by 
G-computation from matched case-control data, under the assumptions of Figure 2. 

6 Back to our motivating study 

Within an Italian study in the genetics of infarction [2J, cases were ascertained on the 
basis of hospitalization for acute myocardial infarction between ages 40 and 45, during 
the 1996 - 2002 period. This study involves the variables represented in Figures 1 and 
2, and which we continue to denote through the symbols introduced in Section O The 
controls were selected by matching them to the cases over sex, geographical area of origin 
and profession (DEMO). 

Our aim here is to estimate, on the basis of the study data, the effect of genetic variation 
reflected by rs9939609 {GENO) on risk of early infarction {MI), controlling for body 
mass {BMI). We work under the assumptions represented in the diagram of Figure 
2, which appear legitimate, especially when one considers the narrow range of ages at 
infarction represented in our sample of cases. Under such assumptions, we have already 
seen in Example 3 that the direct effect of interest is estimable by using the algorithm 
described in the previous section. 

The distribution of the rs9939609 genotype in sample cases and controls is summarized in 
Table [H No major departure from Hardy- Weinberg equilibrium in controls was detected. 



number 
major 


of copies of the 
rs9939609 allele 


controls 


cases 







305 


380 




1 


889 


921 




2 


644 


537 



Table 1. Distribution of the rs9939609 genotype in sample cases and controls. 

Table |2] summarizes results from the fitting of a conditional logistic model for the de- 
pendence of occurrence of early myocardial infarction on wild-type rs9939609 homozygos- 
ity, without any adjustment for other variables in the model (except, of course, for the 
matching variables). This yielded an estimate of 0.76 for the total effect of rs9939609 rare 
homozygosity on infarction, on an odds ratio scale, which is significantly different from 
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OR p-value 95% confidence interval 


rs9939609 wild-type homozygosity? 


0.76 0.0001 0.65 - 0.87 



Table 2. Results from the fitting of a conditional logistic model for the dependence of 
occurrence of early myocardial infarction on rare rs9939609 homozygosity, without any 
adjustment for other variables in the model. This produces an estimate of the total effect of 
the rs9939609 rare homozygosity on susceptibility to early myocardial infarction, on an odds 
ratio of disease scale, reported in the OR column of the table. 





OR 


p- value 


95% confidence interval 


rs9939609 wild-type homozygosity? 


0.81 


0.007 


0.7 - 0.94 


body mass index 


1.15 


< 2e-16 


1.12 - 1.17 



Table 3. Results from the fitting of a conditional logistic model for the dependence of 
occurrence of early myocardial infarction on rare rs9939609 homozygosity, adjusting for body 
mass index. 

the null at a 0.0001 level of significance. This can be interpreted as evidence of an "overall 
protective" effect of the major rs9939609 allele. 

When we further included body mass as an additional explanatory variable in the model, 
we obtained the results of Table El where the effect of rs9939609 wild-type homozygosity 
on infarction, 0.81 on an odds ratio scale, significantly departs from the null at a 0.007 
level of significance. Unfortunately, because conditions ([3lll]) are violated by the diagram 
of Figure 2, we cannot take this estimate as a valid measure of the direct effect of rs9939609 
wild-type homozygosity on infarction, controlling for body mass. One problem here is, 
in fact, that physical exercise and drinking are potential confounders of the association 
between body mass and myocardial infarction. 

Can this problem be overcome by including the BEHAVE variables - physical exercise 
and drinking habit - as additional covariates in the regression model? When we did so, 
the estimated effect of rs9939609 wild-type homozygosity on infarction was 0.84, which is 
a significant (at a 0.02 level) departure from the null (see Table 4). Again, because the 
causal diagram of Figure 2 violates conditions (I3])-(I1]), our method does not guarantee 
that the above estimate, obtained by standard regression, is a valid measure of the direct 
effect of interest. One problem being that the conditioning on BEHAVE opens the 
GENO — 7- BEHAVE 4— f/ — )■ MI path (see Appendix 1) and, as a consequence, it 
introduces a spurious, non causal, association between GENO and MI, so called collider- 
stratification bias. We must accept the fact that, according to our method, no valid 
estimate of the direct effect of interest can be obtained by standard regression. Luckily, 
because the causal diagram of Figure 2 satisfies conditions (I71ITU | [T^ . our method tells us 
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OR p-valne 95% confidence interval 



rs9939609 wild-type homozygosity? 

body mass index 
occasional physical exercise? 
frequent physical exercise? 

drinking habit? 



0.84 0.02 0.72 - 0.98 

1.14 < 2e-16 1.11 - 1.16 

0.61 1.41e-07 0.50 - 0.73 

0.53 3.13e-13 0.44 - 0.63 

1.36 7.48e-05 1.17-1.59 



Table 4. Results from the fitting of a conditional logistic model for the dependence of 
occurrence of early myocardial infarction on rare rs9939609 homozygosity, adjusting for body 
mass index, physical exercise and drinking habit. 

that a valid estimate of the direct effect of rs9939609 on infarction, controlling for body 
mass, can be obtained by using the G-estimation procedure of the preceding section. This 
yields an estimate of 0.72, with a 95% confidence interval of (0.62, 0.84), on a relative risk 
scale. This estimate differs appreciably from the estimates obtained in previous steps of 
the analysis. The fact that the latter estimate refers to the relative risk scale, rather than 
the odds ratio scale, does not entirely explain this difference in view of the low prevalence 
of early-onset myocardial infarction. 

From a substantive point of view, our finding suggests that genetic variation represented 
by rs9939609 may influence heart disease via pathways different from those involved in 
body mass. A biological interpretation of this finding is given at the end of the next 
section. 

7 Discussion 

In this paper, we have started by examining conditions under which controlled direct 
effects can be estimated from prospective observational data via standard regression. 
When these conditions are violated, the direct effect of interest is sometimes still estimable 
from a prospective study, albeit not via regression. We have examined the more general 
conditions under which a controlled direct effect is estimable via G-computation, and we 
have expressed them as properties of a causal diagram representation of the problem. 
Then, in consideration of the increasing importance of matched case-control studies in 
genetic epidemiology, we have shifted attention to this class of studies. We have proposed 
an algorithm for the G-estimation of controlled direct effects from matched case-control 
studies, and characterized the necessary conditions for algorithm validity in terms of 
conditional independence properties of the causal diagram representation of the problem. 

The proposed method is also relevant in situations where the notion of "case" is not 
the usual one. Examples are offered by the papers of Cordell and colleagues [7|, and 
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of Bernardinelli and colleagues [1], where genetic effects are estimated by conditioning 
on parental genotypes, using data from proband-parent trios. These papers essentially 
perform a matched case-control analysis via conditional logistic regression, using the case 
and one or more "pseudocontrols" derived from the untransmitted parental haplotypes. 
This approach could be combined with the methods presented in this paper to assess 
direct controlled genetic effects. 

In the context of retrospective designs, further study is warranted of identification results 
for controlled direct effects in specific model classes, as well as for so-called "natural" direct 
and indirect effects [27] . In addition, further work is needed to investigate whether direct 
effect estimators can be constructed on the basis of matched case-control studies, which 
are either more efficient than the estimator proposed in this paper, or less dependent on a 
rare disease assumption. Finally, future work will also focus on inference under alternative 
strategies for the selection of controls in a retrospective study. 

We have illustrated the method with the aid of a study in the genetics of myocardial 
infarction. Our analysis detected presence of a direct effect of rs9939609 on infarction, 
controlling for body mass. This finding suggests that the effect of this SNP on suscepti- 
bility to infarction is not totally explained in terms of a deleterious effect of FTO on body 
mass. This finding points to a number of possible hypotheses. Very relevant here is re- 
cent evidence that SNPs can, in general, exert an influence on the expression of relatively 
distant (in terms of DNA stretch) genes. In our case, it could be that rs9939609 drives 
the expression of a gene other than FTO, functionally unrelated with FTO, whose effect 
on risk of infarction is not mediated by body mass. And hence the direct effect. Such 
hypothesis is corroborated by biological evidence that the FTO is located in a genomic 
region containing highly conserved genomic regulatory blocks which, according to a well 
established theory, are likely to drive the expression of distant genes [SIEIIEI]- The above 
considerations have useful implications with respect to possible experiments to elucidate 
the mechanism. It is not unlikely that rs9939609 may simultaneously drive the expres- 
sion of different, and functionally unrelated, genes. Such a multi-effect pattern could be 
common. For example a recent study [i8\ shows that SNPs in the 9p21.3 region of DNA, 
notoriously associated with susceptibility to infarction, not only control nearby genes, but 
also the expression of the quite distant IFNA21 gene. Generalizing on this example, one 
might conjecture that many SNPs exert their influence on disease susceptibility through 
non-overlapping pathways, and that this will, in many cases, result in evidence of direct 
and indirect effects that our method is able to capture. 
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Appendix 1: Causal diagrams 

Causal diagrams [8l[T71[25] consist of a set of nodes representing variables in the problem, and 
directed arrows connecting pairs of nodes, as in Figure 1, for example. The same, elliptical, 
shape is used for all nodes. In particular, no distinction is made, in terms of node shape, between 
observed and unobserved variables/nodes, one reason being that this is not a distinction that 
has to do with the causal structure of the system under study. The arrows represent direct 
causal influence, in a sense to be made clear. A path is a sequence of distinct nodes where any 
two adjacent nodes in the sequence are connected by an arrow. A directed path from a node X 
to a node i? is a path where all arrows connecting nodes on the path point away from A and 
towards B. For example, in the graph of Figure 2, the sequence 

GENO, BEHAVE, BMI, MI, UNOBSERVED 

is a path between GENO and UNOBSERVED, but not a directed one. 

If A has a directed path to B then A is an ancestor of B, and B a descendant of A. By 
convention, A is both an ancestor and a descendant of A. If an arrow points from A to B, then 
A is called a parent B. In this paper, we restrict to causal diagrams which have the form of a 
directed acyclic graph (DAG), that is, a directed graph where for any directed path from A to 
B, node B is not a parent of A. A probability distribution over the set of nodes of the graph is 
said to be Markov with respect to the graph if it can be expressed as a product of factors, where 
each factor is the conditional probability of a node of the graph, given its parents in the graph. 

A consecutive triple of nodes, A,B,G say, on a path is called a collider if the arrow between A 
and B and the arrow between C and B both have arrowheads pointing to B. For example, in 
Figure 1, node BEHAVE is a collider on the 

GENO BEHAVE ^ UNOBSERVED 

path. Any other consecutive triple is called a non-collider. A path between two nodes, A and 
B say, is said to be blocked by a set C if either for some non-collider on the path, the middle 
node is in C, or if the path contains a collider such that no descendant of the middle node of 
such collider is in C. For example, in the graph of Figure 2, the path 



GENO BEHAVE BMI ^ MI ^ UNOBSERVED 
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is blocked by any set of nodes that contains either or both of (BEHAVE,BAII), and/or does 
not contain S or MI. In particular, the path is blocked by the empty set of nodes. 

For disjoint sets A,B,C of nodes in a DAG we say A is d-separated from B given C if every 
path from a node in A to a node in B is blocked by C. If A is not d-separated from B given C, 
we say A is d-connected to B given C. For example, in the diagram of Figure lb, nodes ctbmi 
and MI are d-separated given {GENO,BMI, BEHAVE, DEMO). This is because all paths 
between aBMi and MI contain at least one of the following non-colliders: 

{MI, BEHAVE, BMI), {UNOBSERVED, BEHAVE, BMI), 
{UNOBSERVED, DEMO, GENO), {UNOBSERVED, DEMO, BMI), 

{MI,GENO,BMI), 

all of which are blocked by virtue of the fact that BEHAVE, DEMO and GENO are in the 
conditioning set. As a further example, the reader is invited to check that that ctgeno and MI 
are d-connected in the diagram of Figure 16 if BMI, but not GENO, is in the conditioning 
set. Two sets of nodes, A and B say, that are d-separated given a third set C, are conditionally 
independent, in a probabilistic sense, given G, under any distribution that is Markov with 
respect to the graph. By contrast, if A and B are d-connected given G, there exists some 
probability distribution which is Markov with respect to the graph, under which A and B are 
not conditionally independent, given G. 

Appendix 2 

We now prove that, under conditions (iTlfi m [T5]) . model (fTBl) and a matched case-control sampling 
regime of the kind described in Section [5l the data approximately satisfy Equation (jl7p . which 
we here repeat for the reader's convenience: 

E* { - ) exp(-V'A:(^i) - 7M(*i) ) } = 0, (19) 

where the expectation E*{.) refers to the observed data distribution under retrospective sam- 
pling. 

Model ()16p implies: 

E{Y \ ax = x,aM = m, W, Z) 

E{Y\ax = x,aM = 0,W,Z) = ^^P^^^^^' 

from which we obtain: 

E{Y I ax = X, aM = m, X = x, M = m,W, Z) exp(— 7m) = E{Y \ ax = x, aM = 0, W, Z), 

because for a generic variable H the equality an = h logically implies H = h; at least, this 
is true under the so-called consistency assumption that setting H to h hy intervention has no 
effect amongst those for whom H = h is naturally observed. 
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Thanks to the conditioning on X = x and M = m, we may now bring the exp(— 7m) factor of 
the left hand side into the expectation, and further multiply both sides of the equation by the 
factor exp{—tpx), so as to obtain: 

E [Y exp(— ■i/'x — 7m) I ax = x, = m, X = x, M = m,W, Z] = 
= E[Y exp{—^px) \ ax = x, = 0, W, Z\ . 

Then, by virtue of conditions ([9])- (jlOp . respectively, we can eliminate the conditioning on 
ax = X and aM = from the left hand side of the equation, which leads to: 

E {Y exp(-V'a; - 7m) \ X = x,M = m,W,Z} = E {Y exp(-V'2;) \ax = x, au = 0, W, Z] . 

where the expectation at the left hand side is taken with respect to the population distribution 
(which is what the absence of the a indicators in the conditioning part means). From the above 
equation, by virtue of (fT6]l . we obtain: 

E {Y exp(-V'x - 7m) \ X = x,M = m,W,Z} = E[Y \ax = ^,(yM = 0, Z] . (20) 

The above equality implies that, conditionally on W and Z, the random variable 

Y exp(-V'X - 7M) 

is, in expectation under the population distribution, independent of (X, M) and therefore, in a 
sample from a random population, the quantity: 

{X, - E{X}) Yi exp(-V'X, - 7M,) (21) 

has, conditionally on W and Z, zero mean at the true parameter values. 

Recall that we are dealing with a sample from a 1-to-l matched case-control study. For the 
affected member of the ith matched set, consider the quantity: 

E* exp(-V'X(^i) - 7M(^i)) I W = VF^^^^} = 

= £;|xyexp(-V'X -7M) I w = w^^\y = 1} , 

= £;|xyexp(-V'X -7M) I W = /P{Y = l\ W = W^'^^) 

where the expectations E{.) are taken with respect to the population distribution. By virtue of 
the above independence property, the above equation can be rewritten as: 

E^X\W = £:|yexp(-^X(*i) --fM^'^^) \ W = W^'^^^ /P{Y = l\ W = W^'^'^). 

which, in the light of (j20p . can be written as: 

= E[X \ W = [y I cjx = 0, aM = 0, = W^'^^] /P{Y = l\ W = W^'^^), 
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Further, note that by a similar reasoning 



E* \X^^^ exp(-V'X(^i) - 7M(*i)) I W = W^'^^ 

= E[X I W = t^(*o),y = 0] E [yexp(-V^X -7M) I w = w^'^\y = 1 

= E[X I W = W^^\Y = 0] ^ [y I cjx = 0,cJM = 0, = W^'^^ 
/P{Y = l\ W = 

Under a rare disease assumption, we have E\X \ W,Y = {}] ~ | W], which gives Equa- 

tion (jl9p . QiiOfi erai demonstrandum. 



In the remaining part of this Appendix, we derive an estimator for the variance of the estimate 
of the parameter •i/' of Equation ([TH]) . We start by defining 6 = (■0, 5, 7, /?) and let Ui{6) be given 
by: 



V 



^(il) _ ^(iO) 
^(il) _ JiO) 



expit (-(5(x(^i) - - T/(m(^i) - m(^°)) - /3(z(*i) 



)) 



Let ^ denote the estimate of 9 obtained by our method. The variance of 6 is well approximated 
in large samples by the following sandwich estimator: 



ilE- 

n 



''^■<'"^Var(C/.TO)IE-^«^-('" 



89 



89 



(22) 



where Var(C/j(0)) can be estimated by calculating Ui{9), then taking the sample variance of these 
contributions for all subjects, and finally evaluating at 9. The quantity IE {dUi{9) / 89) can be 
estimated by first calculating the gradient matrix dUi{6)/89 for each subject, evaluating it at 
9 and then calculating the sample average (over all subjects) of each component of the matrix. 
In this gradient matrix, the element in the j'th row and /th column should be the derivative of 
the jth component of Ui{9) with respect to the Zth component of 0. The first diagonal element 
of the resulting matrix (j22p gives the approximate variance of ip. 
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